\[ \mathbb{E}\left (y | \mathbf{X}\right ) = \mu = g^{-1}\left (\mathbf{X}\beta \right ) \\ p\left (y|x\right ) \sim \pi\left (\mu,...\right ) \]
link function: \(g()\)
likelihood function: \(\pi()\)
\[ \mathbf{X}\beta \]
“linear in the parameters”: the linear prediction is just a linear combination of the columns of the design matrix, \(\mathbf{X}\)
\[ \mathbf{X}\beta \]
“linear in the parameters”: the linear prediction is just a linear combination of the columns of the design matrix, \(\mathbf{X}\)
❗But no one said the columns had to be “linear”
Link Functions (Generalized Linear Models)
Feature Engineering (Additive Models)
Design Matrix: a matrix containing all the inputs to your model after being processed (e.g. one hot encoding, polynomial basis expansion)
\[ \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 0 \\ 1 & 1 \\ \end{bmatrix} \to \begin{bmatrix} 1 & 1 & 1 & 1\\ 1 & 2 & 4 & 8\\ 1 & 3 & 9 & 27\\ 1 & 0 & 0 & 0\\ 1 & 1 & 1 & 1\\ \end{bmatrix} \]
We don’t want our polynomials to be too wiggly
\[ y_i = \beta_0 + f(x_i) + \epsilon \]
Here, \(y\) is modeled as an intercept plus a smooth function \(f\) of our predictor \(x\).
Note: This is still “linear in the parameters”, we’re just transforming our predictor \(x\) in non linear ways
Consider our Piecewise model:
\[ y_i = \beta_0 + f(x_i) + \epsilon \]
where \(f(x) = \sum_{i=1}^q b_i(x) * \beta_i\) . In this case: \(b_i(x)\) are functions that create dummy variables indicating which range \(x\) is in, and \(\beta_i\) are the means for each range.
Or, consider our polynomial regression:
\[ y_i = \beta_0 + f(x_i) + \epsilon \]
Where \(f(x) = \sum_{i=1}^q b_i(x) * \beta_i\). In this case, \(b_i(x)\) are functions that take \(x\) to the \(i^{th}\) power, and \(\beta_i\) are the corresponding coefficients for each polynomial term.
For \(f(x) = \sum_{i=1}^q b_i(x) * \beta_i\) , \(b_i(x)\) are called basis functions. These are transformations of our original predictor \(x\), and each basis function is added as a predictor to our design matrix.
\[ \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 0 \\ 1 & 1 \\ \end{bmatrix} \to \begin{bmatrix} 1 & 1 & 1 & 1\\ 1 & 2 & 4 & 8\\ 1 & 3 & 9 & 27\\ 1 & 0 & 0 & 0\\ 1 & 1 & 1 & 1\\ \end{bmatrix} \]
When fitting Additive Models and choosing \(f(x)\) to transform our predictor \(x\), we have to consider a few things. For example, for our Piecewise functions, we needed to choose knots.
Knots are points at which the function changes
Currently, the functions are not required to be continuous at the knots. But we could require that. We could also require that the first and second derivatives be equal at the knots so that the function is smooth.
\[ \text{Income} = 35 + 1.5 * \text{years of ed} + 2.5 * (\text{years of ed} - 13) + \epsilon \]
💡 We can choose to write our features in a way that enforce continuity
💡 We can choose to write our features in a way that enforce smoothness
Regression Splines: continuous, smooth, piecewise polynomial functions of a predictor \(x\) are used to predict \(y\)
Regression splines allow us to get flexible models without needed a super high degree polynomial.
❓ How do we create these regression splines that are continuous at each knot, and smooth?
Choosing the right Basis Functions.
Basis Functions \(b_i(x)\) allow us to create smooth functions by multiplying each basis function by a coefficient and adding them together
\[ f(x) = \sum_{i=1}^qb_i(x)*\beta_i \]
But remember: we don’t want our functions to be too wiggly.
Wiggliness: how quickly a function changes across its range of values
\[ \text{wiggle} = \int \left[ f''(x) \right]^2 \]
Second derivative: the rate of change of the rate of change (derivative of a derivative)
💡 In a wiggly function, what would you expect the second derivative to be like?
We want our function to be just wiggly enough to capture the true relationship in the data, but not so wiggly that it overfits
One of the ways we can control the amount of wiggliness in our function, is by choosing the number of basis functions.
\[ \text{Penalized Likelihood} = \text{Likelihood} - \lambda*\text{Wiggle} \]
We can also penalize the model for being too wiggly.
We want to choose \(\lambda\) in a way that allows us to fit the appropriate function (low bias) but does not overfit (low variance).
We have two main levers to control wiggliness:
\(\lambda\): regularization strength
\(k\): number of bases
We’re not limited to one predictor:
\[ y_i = \beta_0 + f_1(x_{1i}) + f_2(x_{2i}) + ... + f_n(x_{ni}) + \epsilon \]
We can use smooth functions of multiple predictor variables to create an additive model.
Generalized Additive Models just apply the same generalization as Generalized Linear Models:
\[ \mathbb{E}\left (y | \mathbf{X}\right ) = \mu = g^{-1}\left (\mathbf{X}\beta \right ) \\ p\left (y|x\right ) \sim \pi\left (\mu,...\right ) \]
link function: \(g()\)
likelihood function: \(\pi()\)
But now we know \(\mathbf{X}\beta\) has the form \(\beta_0 + \sum_{i =1}^q f_i(x_{i})\)
We’re going to use the mgcv package in R to fit GAMs
Family: gaussian
Link function: identity
Formula:
city.mpg ~ fuel + s(hp, sp = 1000) + s(weight)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 32.2914 0.6502 49.66 <2e-16 ***
fuelgas -7.8205 0.6968 -11.22 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(hp) 1.004 1.007 31.78 <2e-16 ***
s(weight) 5.791 7.008 44.92 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.868 Deviance explained = 87.3%
GCV = 5.975 Scale est. = 5.7162 n = 203
In reality, we want to fit our models with method = REML . REML is good for selecting the optimal level of smoothness and can handle a wide variety of effects (e.g. mixed effects) with good numerical stability.
Report edf and significance
Show Partial Effect Plots
Use marginaleffects (see here)
##GAMs in R
We can also fit Generalized versions of GAMs.